Skip to content

Conversation

@errose28
Copy link
Contributor

What changes were proposed in this pull request?

Ensure container scanners shut down cleanly without marking containers unhealthy if ongoing scans are interrupted due to shutdown.

  • Remove redundant shutdown hook in MutableVolumeSet. The datanode already has a shutdown hook set up in HddsDatanodeService#call that does a superset of this hook's operations and does them in the correct order.
    • The shutdown hook in MutableVolumeSet would close the RocksDB before stopping the scanners, causing an error in the running scanners.
  • Check interrupt flag when a scanner thread encounters an exception.
    • Various exceptions may be thrown if the thread is interrupted while doing IO. These include InterruptedIOException and ClosedByInterruptException.
    • Rather than trying to capture all exceptions that could be thrown in these cases, check the thread's interrupt flag on error, and throw InterruptedException if it is set.
      • This indicates that the scan was aborted due to interrupt and should not be marked as passed or failed.

What is the link to the Apache JIRA

HDDS-7100

How was this patch tested?

  • Unit tests for interrupt handling added.
  • Manual docker testing:
    1. Generate data: docker compose exec om ozone freon ockg --size=10000000 -n100 --type=RATIS -rTHREE
    2. Stop datanodes
    3. Inspect logs for errors on scanner shutdown.
    • Before this patch, would see log messages about scanners marking containers unhealthy because RocksDB was closed.
    • After this patch, no such errors are observed on shutdown.
  • I was unable to organically reproduce interrupted IO exceptions while reading from a container.

@errose28 errose28 requested a review from sumitagrawl June 21, 2023 23:41
@errose28
Copy link
Contributor Author

Hi @sodonnel if you have time you may be interested to review this one since you had looked at this issue originally.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 LGTM+1

@errose28 errose28 marked this pull request as ready for review June 22, 2023 16:53
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the patch, LGTM.

* master: (79 commits)
  HDDS-8914. Datanode may fail to start due to duplicate VolumeInfoMetrics (apache#4966)
  HDDS-8921. Add support for EC in Freon SCM block generator (apache#4982)
  HDDS-8927. Metadata scanner should not scan unhealthy containers. (apache#4976)
  HDDS-8929. Avoid list allocation for pipeline search (apache#4980)
  HDDS-8778. Support recursive volume delete using Ozone sh command. (apache#4842)
  HDDS-8885. Quota repair count enable quota feature for old bucket/volume. (apache#4941)
  HDDS-8771. Refactor volume level tmp directory for generic usage. (apache#4838)
  HDDS-8922. Random EC read pipeline ID causes XceiverClient cache churn (apache#4971)
  HDDS-8586 Recon. - API for Count of deletePending keys and amount of data mapped to such keys. (apache#4923)
  HDDS-8908. Intermittent failure in TestBlockDeletion#testBlockDeletion (apache#4958)
  HDDS-8910. Replace LockManager with striped lock in ContainerStateManager (apache#4962)
  HDDS-8917. Move protobuf conversion out of the lock in PipelineStateManagerImpl (apache#4965)
  HDDS-8825. Use apache/hadoop 3.3.5 docker image (apache#4963)
  HDDS-8906. Avoid stream when getting in-service healthy nodes (apache#4960)
  HDDS-8907. Store volume count when storage report is updated (apache#4957)
  HDDS-8905. PipelineManager metrics should not be synchronized (apache#4959)
  HDDS-8553. Improve scanner integration tests. (apache#4936)
  HDDS-8854. Avoid unnecessary DatanodeDetails creation for NodeStateManager lookup (apache#4925)
  HDDS-8315. [Snapshot] Added unit tests for SnapshotDiffManager (apache#4716)
  HDDS-7968. [Snapshot] Improve KeyDeletingService to reclaim eligible key blocks in snapshot's deletedTable (apache#4935)
  ...
@sumitagrawl sumitagrawl merged commit b95a2ca into apache:master Jun 27, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jun 30, 2023
* master:
  HDDS-8555. [Snapshot] When snapshot feature is disabled, block OM startup if there are still snapshots in the system (apache#4994)
  HDDS-8782. Improve Volume Scanner Health checks. (apache#4867)
  HDDS-8447. Datanodes should not process container deletes for failed volumes. (apache#4901)
  HDDS-5869. Added support for stream on S3Gateway write path (apache#4970)
  HDDS-8859. [Snapshot] Return failure message to client for a failed snapshot diff jobs (apache#4993)
  HDDS-8939. [Snapshot] isBlockLocationSame check should be skipped if object is not OmKeyInfo. (apache#4991)
  HDDS-8923. Expose XceiverClient cache stats as metrics (apache#4979)
  HDDS-8913. ContainerManagerImpl: reduce processing while locked (apache#4967)
  HDDS-8935. [Snapshot] Fallback to full diff if getDetlaFiles from compaction DAG fails (apache#4986)
  HDDS-8911. Update Hadoop to 3.3.6 (apache#4985)
  HDDS-8931. Allow EC PipelineChoosingPolicy to be defined separately from Ratis (apache#4983)
  HDDS-8895. Support dynamic change of ozone.readonly.administrators in SCM (apache#4977)
  HDDS-6814. Make OM service ID optional for `ozone s3` commands if only one is defined in config (apache#4953)
  HDDS-8925. BaseFreonGenerator may not complete if last attempts fail (apache#4975)
  HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown (apache#4951)
  HDDS-8919. Allow EC pipelines to be created and then added to PipelineManager in two steps (apache#4968)
  HDDS-8901. Enable mTLS for InterSCMGrpcProtocol. (apache#4964)

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/interfaces/Container.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/ContainerTestUtils.java
vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants